Briefings in Bioinformatics — Latest Matching Preprints

1

Structure-aware graph attention based hierarchical transformer framework for drug-target binding affinity prediction

Kaira, V. S.; Kudari, Z. D.; P, S. S.; Bhat, R.; G, J.

2026-04-22 bioinformatics 10.64898/2026.04.19.719524 medRxiv

Top 0.1%

22.5%

Show abstract

Drug-target interaction prediction is significant in the hit identification phase of drug discovery, enabling the identification of potential drug candidates for downstream optimization. Traditional computational methods have some drawbacks in their ability to represent 3D structural data for both molecules and target proteins, which is required for the intricate protein-ligand interactions that regulate binding affinity. In this approach, we propose a graph transformer-based model (GTStrDTI) that combines an intragraph attention mechanism with cross-modal attention to enrich the representation of both the drug molecule and target protein. This approach comprehensively models both intramolecular structural features and intermolecular interactions, thereby enhancing binding affinity prediction performance. A thorough evaluation on benchmark datasets such as KIBA, DAVIS, and BindingDB_Kd shows that our approach surpasses the state-of-the-art methods under challenging target cold-start settings. Our analysis found that augmenting graph-based 3D structural protein target (C-alpha contact graphs from PDB with threshold distance of 5[A]) and incorporating molecule adjacency information, boosts predictive performance, thus contributing towards narrowing the gap between computational and experimental research.

2

Benchmarking single-cell foundation models for real-world RNA-seq data integration

Han, S.; Sztanka-Toth, T.; Senel, E.; Elnaggar, A.; Patel, J.; Mansi, T.; Smirnov, D.; Greshock, J.; Javidi, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719314 medRxiv

Top 0.3%

12.0%

Show abstract

Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented rankings to summarize metric trade-offs and quantify performance consistency across datasets and evaluation settings. Our findings show that fine-tuning improved technical correction performance; among the foundation models, fine-tuned scGPT_CP performed best. However, the baseline scVI was the top overall performer, ranking first by our multi-metric Leximax ranking and achieving the highest Pareto Front-1 hit. Collectively, our study provides practical insights for adapting foundation models to real-world drug design and development.

3

From GWAS to drug: A framework for drug candidate prioritisation using a gene expression signature matching approach

Chauquet, S.; Jiang, J.-C.; Barker, L. F.; Hunter, Z. L.; Singh, G.; Wray, N. R.; McRae, A. F.; Shah, S.

2026-04-24 genetic and genomic medicine 10.64898/2026.04.22.26349470 medRxiv

Top 0.4%

9.9%

Show abstract

Drug targets supported by human genetic evidence have significantly higher approval rates, making genome-wide association studies a valuable resource for drug candidate prioritisation. Transcriptome-wide association study signature-matching is an emerging in silico approach that integrates GWAS data with expression quantitative trait loci to generate a disease gene expression signature, which is then compared against drug perturbation databases such as the Connectivity Map. Despite recent adoption, there is no consensus on optimal methodology. Here, we systematically benchmark key parameters, including TWAS method, eQTL tissue model, similarity metric, gene set size, and CMap cell line, using LDL cholesterol, familial combined hyperlipidemia, and asthma as proof-of-concept traits. We demonstrate that while TWAS signature-matching can successfully prioritise known first-line treatments, performance is highly sensitive to parameter choice; for instance, the selection of the cell line used for drug signatures alone can dramatically alter drug prioritisation. Based on these findings, we propose a best-practice framework for robust, genetically-informed drug prioritisation using TWAS signature-matching.

4

BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models

Silva de Almeida, B. L.; Bonidia, R.; Bole, M.; Avila-Santos, A.; Stadler, P. F.; Nunes da Rocha, U.; de Carvalho, A. C. P. L. F.

2026-04-22 bioinformatics 10.64898/2026.04.18.719383 medRxiv

Top 0.9%

6.3%

Show abstract

The prediction of biological sequence properties has traditionally relied on alignment-based methods that assume evolutionary homology and depend on curated reference databases. This, in turn, limits scalability and sensitivity for large or heterogeneous datasets, remote homologs, short sequences, and rapidly evolving genomic regions. Although Machine-Learning (ML) approaches offer alignment-free alternatives, their broader adoption is limited by: (i) the lack of standardized, externally validated benchmark models across diverse datasets, and (ii) the technical expertise required for feature engineering, model selection, and evaluation. Automated machine learning (AutoML) alleviates these challenges by systematically optimizing representations and models with minimal user intervention. However, most existing frameworks prioritize task-specific model construction and lack mechanisms for preserving trained models as persistent, comparable benchmarks. We introduce BioAutoML-FAST, an end-to-end web platform for automated ML analysis of nucleotide and amino acid sequences. It supports both classification and regression tasks and automates feature extraction, model training, and evaluation without requiring prior user expertise. Uniquely, it serves as a community benchmarking resource, hosting a continuously expanding repository of reusable, standardized models (currently 60) for genomic, transcriptomic, and proteomic applications. Extensive validation on independent datasets demonstrates performance comparable to or exceeding that of state-of-the-art methods, including protein language models such as ESM-2. BioAutoMLFAST is available at https://bioautoml.icmc.usp.br/. This website is free and open to all users, and there is no login requirement.

5

Expanding P-NET, a multi-purpose biologically informed deep learning framework

Elmarakeby, H.; Glettig, M.; Zhou, A.; Zhou, C.; Tarantino, G.; Aprati, T.; Van Allen, E.; Liu, D.

2026-04-22 bioinformatics 10.64898/2026.04.19.719454 medRxiv

Top 0.9%

6.3%

Show abstract

We present expanded P-NET, a versatile frame-work for deep learning in computational biology based on P-NET, leveraging biological pathways for interpretable predictions. Our framework achieves competitive performance in genomic & transcriptomic prediction tasks. We demonstrate its stability and interpretability compared to traditional machine learning models. P-NET 2.0 incorporates gene and pathway information, providing valuable insights into complex biological processes. The framework is publicly available, enabling its application to various computational biology tasks.

6

Pan1c : a pipeline to easily build chromosome-level pangenome graphs

Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.

2026-04-21 bioinformatics 10.64898/2026.04.17.719212 medRxiv

Top 1%

6.2%

Show abstract

The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.

7

Reveal Principles of Codon Optimization via Machine Learning

Deng, F.; Li, H.; Sun, D.; Duan, G.; Sun, Z.; Xue, G.

2026-04-21 bioinformatics 10.64898/2026.04.16.718958 medRxiv

Top 1%

4.9%

Show abstract

High level of protein expression is usually welcomed in industry and research, and codon optimization is widely used to achieve high expression. Methods of implementing codon optimization can be divided into two branches, one is classical methods which develop cost functions based on empirical law, another is AI methods which learn the codon choice principles from endogenous genes with neural networks. Here we develop two codon optimization tools based on two branches respectively, namely OptimWiz 2.1 and OptimWiz 3.0. Results of fusion protein fluorescence detection indicate that both OptimWiz 2.1 and OptimWiz 3.0 are superior to all the other commercially available codon optimization tools. Principles of codon optimization are revealed in the process of machine learning on both tools.

8

Foundation cell segmentation models performance on live microscopy and spatial-omics data

Miao, Y.; Surguladze, N.; Lerner, J.; Poysungnoen, K.; Ariano, K.; Li, Y.; Zhu, Y.; Van Batavia, K.; Jepson, J.; Van De Klashorst, J.; Ni, B. Y. X.; Armstrong, A.; Rahman, R.; Horstmeyer, R.; Hickey, J. W.

2026-04-21 bioinformatics 10.64898/2026.04.18.719315 medRxiv

Top 1%

4.9%

Show abstract

Accurate cell segmentation is an essential step for quantitative analysis of biological imaging data. Recent advances in deep learning have led to the development of generalist segmentation models that perform robustly across multiple imaging modalities, including label-free phase contrast, fluorescence cell culture, and multiplexed fluorescence tissue imaging. However, systematic comparisons of these models at the level of downstream biological analysis remain limited. To address this gap, we evaluated several recent segmentation models, including Cellpose cyto3, Cellpose-SAM, {micro}SAM, and CellSAM, on phase contrast and fluorescence cell culture images. In addition, Mesmer and InstanSeg were included for benchmarking on multiplexed fluorescence tissue images generated using CO-Detection by IndEXing (CODEX). We found that Cellpose-SAM achieved strong performance on phase contrast images, while SAM-based models consistently performed well on fluorescence cell culture data. In contrast, no single model consistently outperformed others on CODEX datasets. Instead, each model exhibited distinct strengths and limitations, which led to differences in downstream analyses, including clustering and cell type identification. Together, our study emphasizes the importance of selecting segmentation models based on dataset characteristics and analytical goals, rather than relying on a single universal approach.

9

CRISP enables comparisons of image-based spatial transcriptomicsegmentation quality across ten organs

Rose, J. R.; Rose, E. S.; Assumpcao, J. A. F.; Pathak, H.; Peck, H. E.; Sasser, L. E.; Patel, C. J.; Vanover, D.; Santangelo, P. J.

2026-04-21 bioinformatics 10.64898/2026.04.16.718947 medRxiv

Top 1%

4.9%

Show abstract

Image-based spatial transcriptomics depends on cell segmentation to assign transcripts to individual cells, but how segmentation algorithms perform across tissues with distinct cellular architectures is poorly understood. This study presents the broadest independent benchmark to date of cell segmentation algorithms for spatial transcriptomics, comparing five approaches across ten mouse tissues using a 5,006-gene Xenium panel. To quantify segmentation errors, Co-expression Rejection in Segmentation Purity (CRISP) was developed, an open-source tool available in R and Python that measures cell purity through tissue-specific mutually exclusive marker co-expression without requiring ground truth annotations. This benchmark revealed that segmentation algorithms face a fundamental tradeoff between maximizing transcript capture and maintaining cell purity, and that the severity of this tradeoff is tissue-dependent. Proseg achieved the highest average performance across tissues, though the magnitude of its advantage varies with tissue architecture. Overall, CRISP provides per-tissue performance profiles as a practical resource for algorithm selection.

10

Closed-Loop Multi-Objective Optimization for Receptor-Selective Cell-Penetrating Peptide Design

Yamahata, I.; Shimamura, T.; Hayashi, S.

2026-04-21 bioinformatics 10.64898/2026.04.16.718169 medRxiv

Top 1%

4.5%

Show abstract

Cell-penetrating peptides (CPPs) can deliver diverse cargos into cells. However, designing CPPs with receptor-selective interaction profiles remains difficult because interactions with individual cell-surface components cannot be tuned independently. Here, we developed a closed-loop in silico framework for receptor-selective CPP design, in which receptor interactions are formulated as explicit objectives in a multi-objective optimization problem. We first constructed a CPP-like candidate library using a sequence generative model fine-tuned on known CPPs. The framework then evaluated candidate peptides by receptor-wise docking, molecular dynamics simulations, and MM/GBSA to compute receptor-wise binding scores. These scores were used iteratively to propose subsequent candidates by multi-objective Bayesian optimization. Applied to a CXCR4/NRP1 design setting, the framework identified candidates with more favorable predicted interaction profiles, characterized by higher CXCR4 binding scores and lower NRP1 binding scores. We selected 10 peptides from the computationally identified candidates for cell-based imaging and found that 4 showed higher enrichment in CXCR4-positive regions than in NRP1-positive regions under the tested conditions. These results show that the proposed framework provides a practical in silico approach for designing CPPs with receptor-selective interaction profiles.

11

Protein inverse folding through joint modeling of surface and backbone geometry

hong, y.; cai, y.; jiao, y.; qi, m.; Huang, Q.; Sun, L.

2026-04-22 bioinformatics 10.64898/2026.04.20.719544 medRxiv

Top 1%

4.4%

Show abstract

Inverse protein folding aims to generate amino acid sequences compatible with a given protein structure. While recent deep learning methods have achieved strong performance by conditioning on residue-level backbone geometry, backbone-only representations insufficiently constrain surface-exposed residues and thus incompletely capture the structural determinants of sequence identity. Here we propose Surleton, a structure-aware inverse folding framework that jointly models backbone geometry and protein surface organization. By integrating complementary surface geometric information, Surleton refines the conditional sequence distribution and improves the balance of sequence modeling across buried and exposed residues. On the CATH4.2 and SCOPe benchmarks, Surleton consistently outperforms backbone-only baselines in sequence recovery, sequence similarity, and predictive confidence, with especially strong improvements on surface-exposed residues. Together, these findings indicate that protein surface geometry serves as a complementary source of structural constraint and that surface-aware modeling may provide a promising direction for improving inverse protein folding.

12

Protocol for constructing correlation-based molecular networks from large-scale untargeted metabolomics data

Lin, H.; Zhang, L.; Lotfi, A.; Jarmusch, A.; Lee, I.; Kim, A.; Morton, J.; Aksenov, A. A.

2026-04-21 bioinformatics 10.1101/2025.04.26.649581 medRxiv

Top 2%

4.0%

Show abstract

This protocol describes a computational approach for constructing correlation-based molecular networks from untargeted metabolomics data using MetVAE, a variational autoencoder-based framework. Complementing spectral similarity networks, it captures functional relationships re-flected in cross-sample correlations. The workflow imports metabolomics features and sample metadata, adjusts for compositionality, missingness, confounding, and high-dimensionality, esti-mates sparse metabolite correlations, and exports GraphML files for network visualization. In a hepatocellular carcinoma mouse model, it links lipid classes in high-fat-diet animals, suggesting an endogenous "auto-brewery" route to lipotoxic metabolites.

13

Bi-level diversity optimisation for representative protein panel selection

Ou, Z.; James, K.; Charnock, S.; Wipat, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719243 medRxiv

Top 2%

4.0%

Show abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.

14

Human-supervised Agentic AI for Hypothesis Generation and Experimental Assistance in Drug Repurposing

Huynh, D.-L.; Asp, E.; Ballante, F.; Puigvert, J. C.; DeGrave, A.; Karki, R.; Nader, K.; Östling, P.; Pokharel, B.; Rietdijk, J.; Schlotawa, L.; Schmidt, L.; Seal, S.; Seashore-Ludlow, B.; Aittokallio, T.; Spjuth, O.

2026-04-22 bioinformatics 10.64898/2026.04.20.719538 medRxiv

Top 2%

3.9%

Show abstract

Computational drug repurposing has largely been focused on rapid hypothesis generation, yet real-world applications span a far broader lifecycle, from drug candidate suggestion to designing experiments, analyzing assay data, and iteratively refining candidates. Here, we demonstrate that agentic AI can fulfill this entire scope. To this end, we developed RepurAgent, a hierarchical multi-agent AI system comprising a supervisor agent and a planning agent that coordinate four specialized sub-agents -- research, prediction, data, and report -- through a human-in-the-loop design, with episodic memory and retrieval-augmented generation. The system is grounded in data, tools, and standard operating procedures specific for drug repurposing, developed within the REMEDi4ALL consortium. We validated the agentic system across three scenarios spanning the various stages within the repurposing lifecycle: in Acute Myeloid Leukemia, RepurAgent recovered up to 97% of disease-relevant pathways identified by Google Co-Scientist, completing the workflow within 60 minutes; in a retrospective COVID-19 antiviral screen, RepurAgent acted as an adaptive experimental collaborator, prioritizing compounds with AUC-ROC up to 0.98 without predefined thresholds and flagging confounders missed in manual review; and for Multiple Sulfatase Deficiency, it prioritized 82 high-confidence candidates from 5000 compounds, which were further corroborated by domain experts. These results demonstrate that agentic AI can support across the full drug repurposing lifecycle, from hypothesis generation to experimental analysis. RepurAgent is open source and deployed at https://repuragent.serve.scilifelab.se/.

15

GNOMES: an integrated framework for genome-wide normalization and differential binding analysis of CUT&RUN and ChIP-seq data

Roule, T.; Akizu, N.

2026-04-21 bioinformatics 10.64898/2026.04.16.718722 medRxiv

Top 2%

3.6%

Show abstract

BackgroundDespite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding analysis are typically performed using separate bioinformatics tools. Indeed, most differential analysis frameworks operate on raw count matrices, preventing users from visually inspecting normalized signal tracks and evaluating how normalization influences the results. To overcome these challenges, we developed GNOMES (Genome-wide NOrmalization of Mapped Epigenomic Signals), a framework that integrates signal normalization, quality control, and differential binding analysis within a unified workflow. ResultsGNOMES is a user-friendly tool able to process ChIP-seq and CUT&RUN datasets from aligned reads, and generate normalized coverage profiles and differential binding results. The tool implements a robust genome-wide normalization strategy based on percentile scaling of signal local maxima, enabling stable normalization between biological replicates and conditions. GNOMES supports both single- and paired- end sequencing, does not required a negative control (input or IGG), and can be applied to both broad (histone marks) or narrow (transcription factor) enrichment patterns. The workflow includes normalization, optional consensus peak identification, and differential binding analysis. For each step, GNOMES generates extensive quality-control metrics and visual outputs, including normalized bigWig tracks, median signal tracks, BED files of regions with significant changes, and diagnostic plots such as heatmaps and PCA. GNOMES is highly configurable and integrates established tools such as MACS2 for candidate peak regions identification for differential binding analysis, as well as DESeq2 and edgeR for statistical testing. Finally, GNOMES is organism-agnostic and can be applied to epigenomic datasets from any model system. ConclusionsGNOMES provides an integrated and highly customizable environment for normalization and differential binding analysis of epigenomic sequencing data. By integrating signal normalization, with downstream differential statistical method for differential binding analysis, and comprehensive quality control, GNOMES simplifies the analysis of ChIP-seq and CUT&RUN datasets, for the identification of chromatin changes.

16

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

Abbott, J. M.

2026-04-22 bioinformatics 10.64898/2026.04.20.719590 medRxiv

Top 2%

3.6%

Show abstract

Machine learning models for protein-ligand bioactivity prediction are increasingly used in computational drug discovery. However, reported benchmark performance is often sensitive to evaluation design. To further understand evaluation design strategies, we present a systematic evaluation of seven machine learning architectures for kinase inhibitor bioactivity prediction, spanning classical baselines (Random Forest, XGBoost, ElasticNet, multi-layer perceptron) and advanced neural approaches (Graph Isomorphism Network, ESM-2 protein embedding MLP, and a GNN-ESM fusion model). Using a curated ChEMBL-derived kinase activity dataset of 352,874 records across 507 human protein kinase targets, we evaluated all models under three splitting strategies of increasing stringency: random, scaffold-based (Bemis-Murcko), and target-held-out. We observed that Random Forest with Morgan fingerprints achieves near-equivalent or superior performance to all neural architectures under scaffold and target-based evaluation. On target-held-out splits frozen ESM-2 embeddings showed worse generalization, with ESM-FP MLP exhibiting the largest performance degradation. Learned graph representations (GIN) do not outperform fixed 2048-bit ECFP4 fingerprints at this data scale, and tree-based uncertainty methods outperform MC-Dropout implementations tested here on calibration and selective prediction metrics. A JAK kinase subfamily case study shows that protein-aware models achieved 79% top-1 selectivity accuracy versus 52% for pooled fingerprint models. However, stronger baselines using explicit target identity achieved 83-84%, indicating that ESM-2 embeddings in this study functioned primarily as an implicit target identifier. These results indicate that evaluation methodology and statistical rigor are major determinants of reported performance in bioactivity prediction. Benchmark design overview O_FIG O_LINKSMALLFIG WIDTH=177 HEIGHT=200 SRC="FIGDIR/small/719590v1_ufig1.gif" ALT="Figure 1"> View larger version (50K): org.highwire.dtl.DTLVardef@ccbae4org.highwire.dtl.DTLVardef@1020583org.highwire.dtl.DTLVardef@1b7ef76org.highwire.dtl.DTLVardef@ca685a_HPS_FORMAT_FIGEXP M_FIG C_FIG A curated ChEMBL kinase bioactivity dataset (352,874 records, 507 targets) was evaluated under three splitting strategies of increasing stringency. Seven model architectures spanning baselines, protein-aware, and graph neural approaches were each trained under 5-seed replication (105 total runs), with results analyzed across three complementary branches: the main 507-target benchmark, ESM-2 embedding ablation studies on a clean 92-target subset, and a JAK-family selectivity case study with stronger target-conditioned baselines

17

Kernel Matrix Completion with Topological and Spectral Features for Multi-Modal Classification

Rinon, E. M.; Visaya, M. V.; Sambayan, R.

2026-04-22 bioinformatics 10.64898/2026.04.19.713528 medRxiv

Top 2%

3.6%

Show abstract

Kernel methods offer a robust framework for integrating multi-modal datasets into a unified representation, thereby facilitating more comprehensive data interpretation. In the presence of incomplete datasets, multiple kernel learning is employed to enhance the efficiency of data completion and integration. We investigate kernel-based approaches to address the incomplete-data problem with applications to yeast protein data. Biological data such as yeast proteins can be represented through multiple modalities, including gene expression profiles, amino acid sequences, three-dimensional structures, and protein interaction networks. We introduce a computational pipeline based on kernel matrix completion, in which topological data analysis (TDA) and persistent spectral analysis are incorporated into the classification setting. TDA captures geometric structure across scales while spectral descriptors reflect connectivity patterns through Laplacian eigenvalues. Kernel, topological, and spectral descriptors are used with support vector machines to discriminate between membrane and non-membrane yeast proteins. Empirical results show that the combined pipeline improves both kernel completion accuracy and ROC performance relative to baseline kernel-only approaches. The best-performing configuration achieves an ROC score of 0.8632 using the average of three kernels augmented with TDA features. These results demonstrate competitive performance relative to strong kernel-based baselines under incomplete data conditions. The proposed approach provides a unified approach for learning from incomplete heterogeneous data while enriching kernel representations with geometric and spectral information.

18

SPaCeD: Spatial Point Process Distances for Pairing the Heavy and Light Chains of B Cell Receptors from Spatial BCR-seq

Liu, Y.; Nathoo, F.; Laumont, C.; Kalaria, S.; Nelson, B.

2026-04-22 bioinformatics 10.64898/2026.04.19.719512 medRxiv

Top 2%

3.6%

Show abstract

B cells mediate anti-tumor immunity through B cell receptors (BCRs) composed of paired heavy and light chains. Leveraging spatial transcriptomics, we introduce SPaCeD, a framework that infers heavy-light chain pairs by integrating expression matrices with spatial distances between point patterns derived via optimal transport. Applied to ovarian and breast cancer datasets and simulation scenarios, SPaCeD improves pairing accuracy and stability compared with existing methodology (Repair), particularly for pairs with lower spatial expression.

19

DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv

Top 2%

3.6%

Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

20

Benchmarking Agentic Large Language Models for ComplexProtein-Set Functional Annotation

Zhang, X.

2026-04-21 bioinformatics 10.64898/2026.04.18.719404 medRxiv

Top 2%

3.6%

Show abstract

Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein sequences), and the same local evidence: Swiss-Prot BLASTP output, Pfam/HMMER domain hits, DeepTMHMM topology predictions, and SignalP secretion predictions. We audited the nine outputs for coverage, biological correctness, missing evidence, hallucinated or over-specific annotations, and within-method consistency, then merged the best-supported evidence into a final orthogroup annotation table. All nine runs covered all 73 orthogroups, indicating that the agents could retrieve and organize the complete input set. However, normalized calcification-relevance calls were only moderately reproducible: within-method exact tier agreement ranged from 0.397 to 0.685 for Claude App (mean 0.562), 0.342 to 0.740 for Claude Code (mean 0.516), and 0.411 to 0.630 for Codex App (mean 0.539), and the per-run number of high-confidence calls varied from 0 to 12 across the nine runs. The final curated table retained 3 high-confidence, 9 moderate, 18 watchlist, and 43 low-relevance orthogroups. The most robust direct candidates were sulfatase (OG0017138) and sulfotransferase (OG0020703) families and an FG-GAP/integrin-like surface protein family (OG0018986), whereas common error modes included elevating pentapeptide-repeat orthogroups on motif evidence alone, treating weakly secreted housekeeping enzymes as matrix proteins, and taking low-complexity BLAST labels at face value. Skill-enabled agents improved file handling, evidence traceability, and reproducibility of computational checking, but they did not eliminate biological overinterpretation. These results support a best-practice workflow in which LLM agents draft annotations only after deterministic evidence tables are generated, with explicit scoring rules, provenance columns, run-to-run replication, and expert review of high-impact claims.